Convergence issue of gradient descent
Backpropagation
- The divergence function minimized is only a proxy for classification error(like Softmax)
- Minimizing divergence may not minimize classification error
- Does not separate the points even though the points are linearly separable
- This is because the separating solution is not a feasible optimum for the loss function
- Compare to perceptron
- Perceptron rule has low bias(makes no errors if possible)
- But high variance(swings wildly in response to small changes to input)
- Backprop is minimally changed by new training instances
- Prefers consistency over perfection(which is good)
Convergence
Minimize E=21aw2+bw+c
w(k+1)=w(k)−ηdwdE(w(k))
- Gradient descent with fixed step size η to estimate scalar parameter w
- Using Taylor expansion
E(w)=E(w(k))+E′(w(k))(w−w(k))+E′′(w(k))(w−w(k))2
- So we can get the optimum step size ηopt=E′′(w(k))−1
- For η<ηopt the algorithm will converge monotonically
- For 2ηopt>η>ηopt, we have oscillating convergence
- For η>2ηopt, we get divergence
- For generic differentiable convex objectives
- also can use Taylor expansion to estimate
- Using Newton's method
ηopt=(dw2d2E(w(k)))−1
- Quadratic convex function
E=21wTAw+wTb+c
E=21i∑(aiiwi2+biwi)+c
- We can optimize each coordinate independently
- Like η1,opt=a11−1, η2,opt=a22−1
- But Optimal learning rate is different for the different coordinates
- If updating gradient descent for entire vector, need to satisfy
η<2iminηi,opt
- This, however, makes the learning very slow if miniηi,optmaxiηi,opt is large
- Solution: Normalize the objective to have identical eccentricity in all directions
- Then all of them will have identical optimal learning rates
- Easier to find a working learning rate
- Target
E=21wTw+b^Tw+c
- So let w=Sw, and S=A0.5, b^=A−0.5b ,w=A0.5w
- Gradient descent rule
w(k+1)=w(k)−η∇wE(w(k))T
w(k+1)=w(k)−ηA−1∇wE(w(k))T
So we just need to caculate A−1, and the step size of each direction is all the same(1)
For generic differentiable multivariate convex functions
Also use Taylor expansion
E(w)≈E(w(k))+∇wE(w(k))(w−w(k))+21(w−w(k))THE(w(k))(w−w(k))+⋯
We get the normalized update rule
w(k+1)=w(k)−ηHE(w(k))−1∇wE(w(k))T
Use quadratic approximations to get the maximum
Issues
Hessian
- For complex models such as neural networks, with a very large number of parameters, the Hessian is extremely difficult to compute
- For non-convex functions, the Hessian may not be positive semi-definite, in which case the algorithm can diverge
Learning rate
- For complex models such as neural networks the loss function is often not convex
- η>2ηopt can actually help escape local optima
- However always having η>2ηopt will ensure that you never ever actually find a solution
- Using Decaying learning rate